In May 2020, the Georgia Department of Public Health posted the following plot to illustrate the number of confirmed COVID-19 cases in their hardest-hit counties over a two-week period. Health officials claimed that the plot provided evidence that COVID-19 cases were decreasing and made the argument for reopening the state.

The plot was heavily criticized by the statistical community and several media outlets for its deceptive portrayal of COVID-19 trends in Georgia. Whether the end result was due to malicious intent or simply poor judgment, it is incredibly irresponsible to publish data visualizations that obscure and distort the truth.

Data visualization is an incredibly powerful tool that can affect health policy decisions. Ensuring they are easy to interpret, and more importantly, showcase accurate insights from data is paramount for scientific transparency and the health of individuals. For this assignment you are tasked with reproducing COVID-19 visualizations and tables published by the New York Times. Specifically, you will attempt to reproduce the following for January 12th, 2022:

  1. New cases as a function of time with a rolling average plot - the first plot on the page (you don’t need to recreate the colors or theme)
  2. Table of cases and deaths - the first table on the page
  3. The county-level map for previous week (‘Hot spots’) - the second plot on the page (only the ‘Hot Spots’ plot)
  4. Table of cases by state - the second table on the page (do not need to include per 100,000, 14-day change, or fully vaccinated columns columns)

Data for cases and deaths can be downloaded from this NYT GitHub repository (use us-counties.csv). Data for county populations can be downloaded from The US Census Bureau. We will provide code for wrangling population data and date to plot the map in Task #3.

The project must be submitted in the form of a Jupyter notebook or RMarkdown file and corresponding compiled/knitted PDF, with commented code and text interspersed, including a brief critique of the reproducibility of each plot and table. All project documents must be uploaded to a GitHub repository each student will create within the reproducible data science organization. The repository must also include a README file describing the contents of the repository and how to reproduce all results. You should keep in mind the file and folder structure we covered in class and make the reproducible process as automated as possible.

Tips:

cases = c(13, 15, 18, 22, 29, 39, 59, 61, 62, 67, 74, 89, 108, 122)
new_cases = cases - lag(cases)
new_cases
##  [1] NA  2  3  4  7 10 20  2  1  5  7 15 19 14
library(zoo)

new_cases_7dayavg = rollmean(new_cases, k = 7, fill = NA)
new_cases_7dayavg
##  [1]       NA       NA       NA       NA 6.857143 6.714286 7.000000 7.428571
##  [9] 8.571429 9.857143 9.000000       NA       NA       NA

Tasks

Task #1

Create the new cases as a function of time with a rolling average plot - the first plot on the page (you don’t need to recreate the colors or theme).

Code to read in the data and get you started.

# Read in NYT data
# Note that this is read in using a URL but the csv can also be saved and used.
nyt <- read.csv(url("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv"))
dim(nyt)
## [1] 2502832       6
head(nyt)
##         date    county      state  fips cases deaths
## 1 2020-01-21 Snohomish Washington 53061     1      0
## 2 2020-01-22 Snohomish Washington 53061     1      0
## 3 2020-01-23 Snohomish Washington 53061     1      0
## 4 2020-01-24      Cook   Illinois 17031     1      0
## 5 2020-01-24 Snohomish Washington 53061     1      0
## 6 2020-01-25    Orange California  6059     1      0

My Code for task 1

# create data set for task 1
df_task1 <- nyt %>% 
   group_by(date) %>% arrange(date) %>% summarise(
    # calculate cumulative cases and deaths for the U.S. on each day
    # ignore missing values
    all_cases = sum(cases, na.rm = T),
    all_deaths = sum(deaths, na.rm = T)) %>% 

  mutate(
     # calculate new cases and deaths each day
    new_cases = all_cases - lag(all_cases),
    new_deaths = all_deaths - lag(all_deaths)) %>%
  
    # separate the date column into year, month and day in order to filter 
  # out the date during Mar. 2020 -- Jan. 2021
   separate(date, into = c("year", "month", "day")) %>%
  filter( ((year == "2020")&(!(month %in% c("01", "02")))) | 
            ((year == "2021")&(month == "01")&(day <= 18)) ) %>%
    
     # calculate a seven-day rolling average, and return NA for days where 
     # the seven-day rolling average can't be calculated
    mutate(new_cases_7dayavg = rollmean(new_cases, k = 7, fill = NA),
           new_deaths_7dayavg = rollmean(new_deaths, k = 7, fill = NA)
      ) %>% 
  
  # put the date back again
  unite(col = date, year, month, day, sep = "-") 


## create the plot 
plot_task1 <- ggplot(df_task1) +
  
  # create the histogram for new cases each day
  geom_histogram(aes(x = date, y = new_cases, fill = "pink", alpha = 0.6),
                stat = "identity") +
  
  # create the 7-day rolling average
  geom_line(aes(x = date, y = new_cases_7dayavg, group = 1), color = "firebrick3",
              method = "loess", se = F, size = 0.8) +
  
  # create title and label x- and y-axis
  labs(title = "Coronavirus in the U.S.: Latest Map and Case Count",
       subtitle = "Updated January 18, 2021", 
       y = "Cases", x = "Date") +
  scale_x_discrete(
    breaks = c("2020-03-01", "2020-04-01", "2020-05-01", "2020-06-01", "2020-07-01",
               "2020-08-01", "2020-09-01", "2020-10-01", "2020-11-01", "2020-12-01",
               "2021-01-01"),
    labels = c("Mar.2020", "Apr.", "May", "Jun.", "Jul.", "Aug.", "Sept.", "Oct.",
               "Nov.", "Dec.", "Jan.2021"), expand = c(0,0)) +
  scale_y_continuous(
    breaks = c(0, 100000, 200000, 300000), 
    labels = c("0", "100,000", "200,000", "300,000 cases"),
    expand = c(0,0), limits = c(0, 350000)
  ) +
  theme_bw() +
  theme(panel.grid.major.x = element_blank(),
        panel.grid.major.y = element_line(linetype = "dashed", color = "grey", 
                                          size = 0.3),
        axis.ticks.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        panel.border = element_blank(),
        axis.line.x = element_line(colour = "gray", size = 0.3, linetype = "solid"),
        legend.position = "none") +
  
  # add text annotation about the line and histogram
  geom_text(x = "2020-07-23", 
            y = df_task1$new_cases[df_task1$date == "2020-07-23"]+20000,
            label = "7-day average") +
  geom_text(x = "2020-12-10", 
            y = 300000,
            label = "New cases")  +

  geom_segment(x = "2020-07-23", xend = "2020-07-23",
               y = df_task1$new_cases_7dayavg[df_task1$date == "2020-07-23"],
               yend = df_task1$new_cases[df_task1$date == "2020-07-23"]+10000) +
  geom_segment(x = "2021-01-01", xend = df_task1$date[which.max(df_task1$new_cases)],
               y = 300000, yend = 300000)

plot_task1

# save the plot that I reproduced
ggsave(filename = "task1.png", plot = plot_task1, 
       path = "reproduced_figures", 
       width = 20,
       height = 10,
       units = c("cm"))

Task #2

Create the table of cases and deaths - the first table on the page, right below the figure you created in task #1. You don’t need to include tests or hospitalizations.

My Code for task 2

# create data set for task 2 - new cases and death on Jan. 17, 2021
df_task2_jan17 <- df_task1 %>% filter(date == "2021-01-17") %>% 
  summarize(cases = sum(new_cases),
            deaths = sum(new_deaths))

# create data set for task 2 - cumulative cases and deaths on Jan. 17, 2021
df_task2_total <- nyt %>% filter(date == "2021-01-17") %>%
  summarize(cases = sum(cases, na.rm = T), deaths = sum(deaths, na.rm = T))

# create data set for task 2 - 14-day change on Jan. 17, 2021
df_task2_14dchange <- rbind(
  # 14 day change = 7-day change on (2021-01-14 - 2020-12-30)/2021-01-14
  paste(round(100*(df_task1$new_cases_7dayavg[df_task1$date == "2021-01-14"] - 
  df_task1$new_cases_7dayavg[df_task1$date =="2020-12-30"])/
  df_task1$new_cases_7dayavg[df_task1$date == "2021-01-14"],0), "%", sep = ""),
  
  paste(round(100*(df_task1$new_deaths_7dayavg[df_task1$date == "2021-01-14"] - 
  df_task1$new_deaths_7dayavg[df_task1$date =="2020-12-30"])/
  df_task1$new_deaths_7dayavg[df_task1$date == "2021-01-14"],0), "%", sep = "")
)


# create table for task 2
df_task2 <- data.frame(total = format(t(df_task2_total), format = "d", big.mark=","),
                       jan17 = format(t(df_task2_jan17), format ="d",big.mark=","),
                       change = df_task2_14dchange)
colnames(df_task2) <- c("Total reported", "On Jan. 17", "14-Day change")
rownames(df_task2) <- c("Cases", "Deaths")

table_task2 <- df_task2 %>% kable("html") %>% kable_classic() %>%
  row_spec(0, bold = T) %>% column_spec(1, bold = T) %>%
  kable_styling(full_width = F)

table_task2
Total reported On Jan. 17 14-Day change
Cases 23,986,856 170,094 6%
Deaths 397,624 1,730 21%
# save the table
save_kable(table_task2, "reproduced_figures/task2.png", zoom = 1.5)

Task #3

Create the county-level map for previous week (‘Hot spots’) - the second plot on the page (only the ‘Hot Spots’ plot). You don’t need to include state names and can use a different color palette.

Code to wrangle county population data and map data.

# Get US county populations from census
county_pop <- as.data.frame(data.table::fread("https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/counties/totals/co-est2019-alldata.csv"))
# Wrangle data and pull population estimates from 2019
county_pop <- county_pop %>%
  mutate(STNAME = str_to_lower(STNAME),
         CTYNAME = str_replace(CTYNAME, "\\sCounty|\\sParish", ""),
         CTYNAME = str_replace(CTYNAME, "\\.", ""),
         CTYNAME = str_to_lower(CTYNAME),) %>%
  select(STNAME, CTYNAME, POPESTIMATE2019) %>%
  rename(region = STNAME, subregion = CTYNAME, population = POPESTIMATE2019)
head(county_pop)
##    region subregion population
## 1 alabama   alabama    4903185
## 2 alabama   autauga      55869
## 3 alabama   baldwin     223234
## 4 alabama   barbour      24686
## 5 alabama      bibb      22394
## 6 alabama    blount      57826
# Load map data (US counties)
library(usmap)
library(maps)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map
counties <- map_data("county")
head(counties)
##        long      lat group order  region subregion
## 1 -86.50517 32.34920     1     1 alabama   autauga
## 2 -86.53382 32.35493     1     2 alabama   autauga
## 3 -86.54527 32.36639     1     3 alabama   autauga
## 4 -86.55673 32.37785     1     4 alabama   autauga
## 5 -86.57966 32.38357     1     5 alabama   autauga
## 6 -86.59111 32.37785     1     6 alabama   autauga
# Merge map data frame and population data frame
counties <- counties %>% 
  left_join(county_pop, by = c("region", "subregion"))

head(counties)
##        long      lat group order  region subregion population
## 1 -86.50517 32.34920     1     1 alabama   autauga      55869
## 2 -86.53382 32.35493     1     2 alabama   autauga      55869
## 3 -86.54527 32.36639     1     3 alabama   autauga      55869
## 4 -86.55673 32.37785     1     4 alabama   autauga      55869
## 5 -86.57966 32.38357     1     5 alabama   autauga      55869
## 6 -86.59111 32.37785     1     6 alabama   autauga      55869
# Wrangle NYT data to match counties data frame.

nyt_task3 <- nyt %>% rename(region = state,
                      subregion = county) %>%
  mutate(region = str_to_lower(region),
         subregion = str_to_lower(subregion),
         subregion = str_replace(subregion, "\\.", ""),) 

head(nyt_task3)
##         date subregion     region  fips cases deaths
## 1 2020-01-21 snohomish washington 53061     1      0
## 2 2020-01-22 snohomish washington 53061     1      0
## 3 2020-01-23 snohomish washington 53061     1      0
## 4 2020-01-24      cook   illinois 17031     1      0
## 5 2020-01-24 snohomish washington 53061     1      0
## 6 2020-01-25    orange california  6059     1      0
# Calculate average daily cases for the plot - remember to group by region, subregion, and date. Then filter to only include the date 2022-01-12.

# Your code here

My Code for task 3: calculate daily average case per 100,000 people for each subregion on 2022-01-12

nyt_task3_1 <- nyt_task3 %>% 
  
  # calculate new daily cases
  group_by(region, subregion) %>% 
  
  # calculate daily cases
  arrange(date) %>%
  mutate(new_cases = cases - lag(cases)) %>%

  # filter data for the past week of 2022-01-12
  filter(date %in% c("2022-01-06", "2022-01-07", "2022-01-08", "2022-01-09",
                     "2022-01-10", "2022-01-11", "2022-01-12") ) %>% 
  
  # merge population info
  left_join(county_pop,  by = c("region", "subregion")) %>%
  
  # calculate daily average cases for 100,000 people for each subregion
  mutate(avg_cases = mean(new_cases, na.rm = T)/population*100000) %>%
  filter(date == "2022-01-12")
  
head(nyt_task3_1)
## # A tibble: 6 × 9
## # Groups:   region, subregion [6]
##   date       subregion region   fips cases deaths new_cases population avg_cases
##   <chr>      <chr>     <chr>   <int> <int>  <int>     <int>      <int>     <dbl>
## 1 2022-01-12 autauga   alabama  1001 12180    162        78      55869      180.
## 2 2022-01-12 baldwin   alabama  1003 44353    603       457     223234      195.
## 3 2022-01-12 barbour   alabama  1005  4437     83        62      24686      232.
## 4 2022-01-12 bibb      alabama  1007  5046     95        39      22394      233.
## 5 2022-01-12 blount    alabama  1009 12068    202       103      57826      141.
## 6 2022-01-12 bullock   alabama  1011  1922     46        21      10101      240.

My Code for task 3: merge daily average per 100k people with the counties data frame

# Merge your updated nyt data frame and counties data frame by joining by region and subregion. 

nyt_task3_2 <- nyt_task3_1 %>% select(!c(fips, deaths, population)) %>%
  left_join(counties, by = c("region", "subregion")) %>% 
  mutate(avg_cases_group = ceiling(avg_cases/10),
         avg_cases_group = ifelse(avg_cases > 70 & avg_cases <= 85, 
                                  8, avg_cases_group),
         avg_cases_group = ifelse(avg_cases > 85 & avg_cases <= 100, 
                                  9, avg_cases_group),
          avg_cases_group = ifelse(avg_cases > 100 & avg_cases <= 175,
                                   10, avg_cases_group),
         avg_cases_group = ifelse(avg_cases > 175 & avg_cases <= 250,
                                   11, avg_cases_group),
         avg_cases_group = ifelse(avg_cases > 250,
                                   12, avg_cases_group),
         avg_cases_group = ifelse(avg_cases <= 0 | is.na(avg_cases),
                                   0, avg_cases_group),
         ) %>% mutate(
           avg_cases_group = 
             recode(avg_cases_group, `0` = "Few or no cases", `1` = "1 - 10",
                    `2` = "11 - 20", `3` = "21 - 30", `4` = "31 - 40", 
                    `5` = "41 - 50", `6` = "51 - 60", `7` = "61 - 70",
                    `8` = "71 - 85", `9` = "86 - 100", `10` = "101 - 175",
                    `11` = "176 - 250", `12` = "> 250")
         ) %>%
      mutate(avg_cases_group = factor(avg_cases_group, 
            levels = c("Few or no cases", "1 - 10", "11 - 20", "21 - 30", "31 - 40", 
            "41 - 50", "51 - 60", "61 - 70", "71 - 85", "86 - 100", "101 - 175",
            "176 - 250",  "> 250")))
  

head(nyt_task3_2)
## # A tibble: 6 × 12
## # Groups:   region, subregion [1]
##   date      subre…¹ region cases new_c…² avg_c…³  long   lat group order popul…⁴
##   <chr>     <chr>   <chr>  <int>   <int>   <dbl> <dbl> <dbl> <dbl> <int>   <int>
## 1 2022-01-… autauga alaba… 12180      78    180. -86.5  32.3     1     1   55869
## 2 2022-01-… autauga alaba… 12180      78    180. -86.5  32.4     1     2   55869
## 3 2022-01-… autauga alaba… 12180      78    180. -86.5  32.4     1     3   55869
## 4 2022-01-… autauga alaba… 12180      78    180. -86.6  32.4     1     4   55869
## 5 2022-01-… autauga alaba… 12180      78    180. -86.6  32.4     1     5   55869
## 6 2022-01-… autauga alaba… 12180      78    180. -86.6  32.4     1     6   55869
## # … with 1 more variable: avg_cases_group <fct>, and abbreviated variable names
## #   ¹​subregion, ²​new_cases, ³​avg_cases, ⁴​population

Mapping US state counties is possible using the maps package by using map_data("county"). Here we choose a red outline and no fill color. You will need to fill the counties with the average number of daily cases per capita and can change the outline color to white.

AllCounty <- map_data("county")
AllCounty %>% ggplot(aes(x = long, y = lat, group = group)) +
              geom_polygon(color = "red", fill = NA, size = .1 )

My Code for task 3: create the map

plot_task3 <- nyt_task3_2 %>% ggplot(aes(x = long, y = lat, group = group)) +
              #geom_polygon(color = "red", fill = NA, size = .1 ) +
geom_polygon(color = "white", aes(fill =avg_cases_group), size = 0,
             data = nyt_task3_2)  +
  theme_void() +
  labs(title = "Hot spots",
       subtitle = "Average daily cases per 100,000 people in past week of 2022-01-12") +
  scale_fill_manual(
    values = c("Few or no cases" = "#ecebe3", "1 - 10"= "#f2df91",
    "11 - 20" = "#f9c468", "21 - 30" = "#ffa83d", "31 - 40" = "#ff8b23", 
    "41 - 50" = "#fd6a0d",  "51 - 60" = "#f04f09", "61 - 70" = "#d8382e",
    "71 - 85" = "#c62833", "86 - 100" = "#af1b43", "101 - 175" = "#8a1739",
    "176 - 250" = "#701447", "> 250" = "#4c0e3e")) +
  theme(legend.title = element_blank(),
        plot.background = element_rect(fill = "white"),
        plot.margin = margin(1,1,1,1, "cm"))
  
plot_task3

# save the plot
ggsave(filename = "plot_task3.png", plot = plot_task3, 
       path = "reproduced_figures", 
       width = 20,
       height = 12,
       units = c("cm"))

Task #4

Create the table of cases by state - the second table on the page (do not need to include per 100,000, 14-day change, or fully vaccinated columns).

Task #5

Provide a brief critique of the reproducibility of the figures and tables you created in tasks 1-4.